library(gapminder)
mutate(gapminder, v3 = pop/10^6)
filter(gapminder, pop > 10)Lecture 4: Debugging, and data visualization with tidyverse!
CSSS 508
Updates and reminders
- Homework and peer reviews (including for HW2) will be graded for completion
- This is a one-credit course and I want the focus to be on practice, not points
- Reminder: this week’s office hour is in Thomson, not Padelford (see website for details)
- Does the office hour time work for people?
Outline for today?
First half of today:
- Quick review of topics that came up on Ed Discussion
- Practice with debugging
Second half of today:
- Pivoting tables and data
- Introduce plots with tidyverse!
Quick review
Code chunk options
There are several code chunk options, including echo, include, and eval. You can review them in the Quarto documentation here.
Which chunk option do we use…
- to show/hide the code in the rendered html document?
- to run/not run the code?
- to run the code but not show any output or messages from it?
Which chunk option(s) should we use when we load packages at the beginning of the Quarto document?
Which chunk option(s) should we use if we want to show the code we used to make a graph, but we don’t want to actually run it (we might do this in a tutorial, like to show how to install a package)?
Formatting text to look like code
To do this, surround the text with backticks (`): `mutate()` will become mutate().
Debugging
General principles and tips:
- Test your code/render your document frequently so that any new errors must be coming from a small set of recent changes you made
- Keep calm and be patient with yourself and with R
- Break the problem down into testable pieces, ruling out possible explanations one at a time
- Read and maybe google the error messages
- Some are more useful to google than others; you get more intuition for this over time
- Identify spots in the code that are likely causing the issue
- Helpful to test/render frequently
- Naming your Quarto code chunks can help to identify the spot that’s causing the issue (we’ll talk about this today)
- Look for clues from the error messages you get, like
object XXX not foundor an issue with a particular function
- Change one of the possible problems, and test if you still get the error
- Read and maybe google the error messages
- Step through your code line by line, thinking about what is defined so far and what the result of each line is
Here’s an example that brings together a bunch of different concepts we’ve seen. Try this in small groups, then we’ll go over it as a class.
I tried to write code to convert the population to millions and filter for just the rows with a population of at least 10 million. Why isn’t it working? How would you modify the code to get it working?
Note:
- You’ll probably have to debug this code in multiple steps
- Just because the final result doesn’t print an error message doesn’t guarantee it’s working; make sure to check whether at the end you have the result you wanted
How to read a Quarto rendering error message
When you try to render a Quarto document and it fails, you get an error message something like this:
Let’s break this down.
- The thing with the arrow
==>at the top is the command that RStudio runs when you click the Render button. - Then it says it’s processing your file, and it shows some of the progress it made through rendering
- The “unnamed-chunk-2” refers to a code chunk in your code
- It says unnamed because we didn’t name the chunk (we hadn’t talked about how to do that anyway). If you had named it, the name would appear there instead of “unnamed-chunk-2”.
- The actual error message starts with “Quitting from ….”
- The part that tells you something about what’s happening is the error message itself:
object 'true' not found.This suggests that somewhere you typed a “true” where it doesn’t know what to do with it.
- The part that tells you something about what’s happening is the error message itself:
Naming chunks
If I render the following Quarto document:
I get the following error message:
If I take the same Quarto document and just name my code chunks like so:
Then I get somewhat more helpful output about the progress before the error:
Make a Quarto document that looks like the second one above, with the named chunks.
Then identify and fix the error(s) to get it rendering successfully.
Render frequently!
Rule number 1: render frequently! But of course sometimes you forget, or you have other reasons that you add a lot of code or make a lot of code before rendering. Let’s talk about how to troubleshoot in that case: sequentially cut out code and try rendering again.
- Cut (as in cut and paste, not as in delete) everything after the first code chunk.
- Try to render the document again.
- If it doesn’t render successfully, you know that at least the first issue is in the small document you just tried to render: the YAML header, some Markdown text/formatting, or the first code chunk.
- If it renders without errors, then the issue must be happening somewhere in the part that you cut. Paste it back, and now repeat step 1 but cut everything after the second code chunk. And so on.
Once you identify what line is triggering the error message, then you can work on figuring out what caused the error. Remember: this is not necessarily the code you want to edit. The part of the code you modify to fix the issue could be in an earlier line of code or an earlier code chunk.
In fact let’s practice that right now! In the qmd file below,
- Which line is generating the error message? (To which line is the error message referring?)
- Where would you edit the code to fix the error?
Pivoting tables and data
Recall where we left off last time with data manipulation in tidyverse: We wanted to summarize how many observations we had for each year and each continent, so we made the following table.
n_obs_by_year_cont <- gapminder %>%
group_by(year, continent) %>%
summarize(n.obs = n())
kable(head(n_obs_by_year_cont, n=12))| year | continent | n.obs |
|---|---|---|
| 1952 | Africa | 52 |
| 1952 | Americas | 25 |
| 1952 | Asia | 33 |
| 1952 | Europe | 30 |
| 1952 | Oceania | 2 |
| 1957 | Africa | 52 |
| 1957 | Americas | 25 |
| 1957 | Asia | 33 |
| 1957 | Europe | 30 |
| 1957 | Oceania | 2 |
| 1962 | Africa | 52 |
| 1962 | Americas | 25 |
This table has a separate row for each year-continent combination. What we would probably find more readable is to have each row represent a year (or continent) and each column represent a continent (or year), and then the values currently in the n.obs column would become the entries in each cell of the table:
| year | Africa | Americas | Asia | Europe | Oceania |
|---|---|---|---|---|---|
| 1952 | 52 | 25 | 33 | 30 | 2 |
| 1957 | 52 | 25 | 33 | 30 | 2 |
| 1962 | 52 | 25 | 33 | 30 | 2 |
| 1967 | 52 | 25 | 33 | 30 | 2 |
| 1972 | 52 | 25 | 33 | 30 | 2 |
| 1977 | 52 | 25 | 33 | 30 | 2 |
| 1982 | 52 | 25 | 33 | 30 | 2 |
| 1987 | 52 | 25 | 33 | 30 | 2 |
| 1992 | 52 | 25 | 33 | 30 | 2 |
| 1997 | 52 | 25 | 33 | 30 | 2 |
| 2002 | 52 | 25 | 33 | 30 | 2 |
| 2007 | 52 | 25 | 33 | 30 | 2 |
We say that n_obs_by_year_cont is currently in long format, and we would like to pivot to a wider format. To do that, we’ll use the tidyverse function pivot_wider():
# we start with n_obs_by_year_cont
table_year_cont = n_obs_by_year_cont %>%
pivot_wider(
# get the names for the new columns from the continent column
names_from = continent,
# get the values for the new columns from the n.obs column
values_from = n.obs
)Sometimes we have a wide-format table that we want to pivot longer. This will come up later as we’re plotting.
Consider that last code chunk, copied and pasted here for clarity:
# we start with n_obs_by_year_cont
table_year_cont = n_obs_by_year_cont %>%
pivot_wider(
# get the names for the new columns from the continent column
names_from = continent,
# get the values for the new columns from the n.obs column
values_from = n.obs
)How would we change this code so that rows are continents and columns are years?
What other code would we have to run before this code in order for it to work? In other words, if we open a fresh R Session, what is the minimum set of code we would need to run first so that we could create
table_year_contwithout errors? Or equivalently, if we put the above code in a Quarto document, what is the minimum set of code we would need to put in the Quarto document before this code chunk so that the Quarto document would render without errors?
Data viz
Now for some plotting.
Not that kind of plotting.
Getting started with ggplot()
By popular demand, let’s make a plot of population over time for the country of Japan.
First, review: how do you get the subset of the gapminder dataset that consists of just the Japan data?
# finish this code...
japan_data =Cool, now let’s plot population versus time with ggplot():
Weird, what do you notice? ggplot() is funny in that the first line which actually has the ggplot function only declares the initial plot area; it doesn’t make the full plot. To do that, we add a + at the end of the ggplot() line and add additional lines of code. For example, “geoms” (geometry layers) add the actual lines, points, bars, etc.:
We can combine multiple layers too as long as they make sense for the data structure:
We can make things a lot prettier and more customized too. Here are just a few examples of things we can do:
Plotting multiple countries
What if we want to compare several countries on the same plot? We might try the following code, but it looks bonkers. Try it and see:
multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))
ggplot(multi_country_data, aes(x = year, y = pop)) +
geom_line()Why does the plot look like that, and how can we fix it? As part of troubleshooting, we might check what a scatterplot looks like (without connecting the lines):
Notice that looks reasonable. So what’s the issue with the line plot?
The issue is that the multi_country_data dataset has both multiple countries and multiple years, and currently ggplot does not know to group the time series data by country when it is deciding how to connect the points with lines. It is treating it as data to plot as a single line, when we would like a separate line for each country.
To fix this, we will specify a grouping variable using the group argument to the aes() function:
multi_country_data = filter(gapminder, country %in% c("Japan", "Nigeria", "Argentina", "New Zealand"))
ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
geom_line()We probably also want to distinguish and label the lines somehow by what country they represent:
Facets (subplots)
What if we want to plot each country in a different subplot so that we can see each curve on its own scale? We can use facet_wrap():
ggplot(multi_country_data, aes(x = year, y = pop, group = country)) +
geom_line() +
facet_wrap(~ country, scales = "free")Note that since they all share an x-axis (years), it might make sense to plot them vertically stacked so that the years line up. For this, we can use facet_grid() which allows us to arrange the plots specifically in a row or a column:
Pivoting for plots
What if we want to plot one country but three different variables: lifeExp, pop, and gdpPercap? We’d basically like to group by variable, but to do that, it has to be a column of the dataset. As in we need a column whose values are “lifeExp”, “pop”, and “gdpPercap” (or some other names for these three quantities).
To do this… yes, we will pivot longer!
japan_wide = japan_data %>%
pivot_longer(
cols = lifeExp:gdpPercap,
names_to = "variable",
values_to = "value"
)
ggplot(japan_wide, aes(x = year, y = value, group = variable)) +
geom_line() +
facet_wrap(~ variable, scales = "free")Again, we can use facet_grid() to stack the plots so they align by year:
ggplot(japan_wide, aes(x = year, y = value, group = variable)) +
geom_line() +
facet_grid(variable ~ ., scales = "free")To see what else you can do with ggplot, check out further documentation online such as the ggplot gallery and the ggplot vignettes or “articles”. Happy exploring!